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SYNTHETIC GENES FOR ENHANCED EXPRESSION 

5 CROSS-REFERENCE TO RELATED APPLICATION 

This application is a continuation-in-part and claims priority of Application 
No. 09/494,921; filed January 31, 2000. 

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT 
10 This invention' was made in part with government support under Grant 

Nos. 1R43DK55951-01 and 1R43GM60822-01, awarded by the National Institutes of Health. 
The government thus has certain rights in the invention. 

O BACKGROUND OF THE INVENTION 

't\ 15 The field of the invention is synthetic nucleic acid sequences for improved amplification 

UJ and expression in a host organism, and methods of creating tliem. ^' 

' • ' ' ' 

It has been a goal of biotechnology to promote the expression of cloned genes for analysis 

U of gene structure and function and also for commercial-scale synthesis of desirable gene products. 

N DNA cloning methods have enabled the genetic modification of bacteria and unicellular 

Ll 20 eukaryotes to produce heterologous gene products. In principle, the genes may originate from 

almost any source, including other bacteria, animal cells or plant cells. Although this expression 

of heterologous genes is a function of a variety of complex factors, maximizing the expression 

of cloned sequences has been under intense and rapid development. Plasmid and viral vectors 

have been developed in both prokaryotes and eukaryotes that enhance the level of expression of 

25 cloned genes. In some cases the vector itself contains the regulatory elements controlling the 

expression of genes which are not normally expressed in the host cell so that a high level of 

expression of heterologous genes can be obtained. 

Several problems exist, however, in the expression of many proteins across phyla and 

even across species. Post-translational handling and modification of expressed proteins by the 

30 host cell often does not mimic that of the heterologous gene's own cell type. Frequently, even 

if the protein is expressed in a useftil form, heterologous genes are poorly expressed. Low yields 

of expressed protein may make manufacture of commercially useful quantities impossible or 

prohibitively costly. Vectors designed to enhance expression are not able to overcome some 

expression problems if the regulatory elements of the vector are not the constraint on robust 

35 expression. Other cellular or translational constraints are at issue. 
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Genes encoding poorly expressed proteins are often themselves difficult to clone and 
amplify as well. This can be due to secondary structure inherent in the gene, for example caused 

5 by high G-C content. Some methods have been used to reduce these difficulties, such as the use 
of DMSO or betaine to bring G-C and A-T melting behaviors more into alignment, or the use of 
ammonium sulfate (hydrogen binding cations) to destabilize G-C bonding during PCR. The 
problem with these methods is that the effects of the additives are concentration dependent, so 
variations in template size and G-C content mean lengthy optimization procedures. Additionally, 

10 these steps do nothing to facilitate subsequent expression of the nucleic acid once it has been 
cloned. 

The frequency of particular codon usage in Escherichia coli and other enteric bacteria has 
long been known, and it has been hypothesized that replacement of certain rare codons encoding 
O a particular amino acid in a heterologous eukaryotic or prokaryotic gene with a codon that is 

^1 15 more commonly used by the selected host bacterium (or eukaryotic host cell) would enhance 
hi expression (see, e.g., Kane, Curr Opin Biotechnol 6:494-500 (1995) and Zahn, J BacterioL, 

J: 178:2926-2933 ( 1 996)). This is based on the theory that rare codons have only a few tRNAs per 

y cell and that transcription of heterologous sequences having numerous occurrences of these rare 

codons is limited by too few available tRNAs for those codons. However, simple replacement 
L, 20 of rare codons does not reliably improve expression of heterologous genes, and no broadly 
rJ applicable method exists to select which codon changes are best to increase expression of 

^ heterologous sequences. Further, it is not known in detail how codon usage is related to 

o expression level. 

Q ' Many gene products, often from bacteria, are commonly used as research and assay 

25 reagents, and various microbial enzymes increasingly are finding applications as industrial 
catalysts (see, for example, Rozzell, J.D., "Commercial Scale Biocatalysis: Myths and Realities," 
Bioorganic and Medicinal Chemistry, 7:2253-2261 (1999), herein incorporated by reference). 
Some have substantial commercial value. Examples include heat-stable Taq polymerase from 
Thermus aquaticus, restriction enzymes such as Eco Rl from E. coli, lipase from Pseudomonas 
30 cepacia, p-amylase from Bacillus sp., penicillin amidase from E. coli and Bacillus sp,, glucose 
isomerase from the genus Streptomyces, and dehalogenase from Pseudomonas putida. Genes 
from bacteria may express easily in commercially useful host strains, but many do not. In 
particular, genes from many bacteria have significantly different codon preferences from enteric 
bacteria. For example, filamentous bacteria such as streptomycetes and various strains of the 
35 genus Bacillus, Pseudomonas, and the like can be difficult to express abundantly in enteric 
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bacteria such as E. coli. An example of a Pseudomonas gene that is difficuh to express in E. coli 
is the enzyme methionine gamma-lyase, useful for the assay of L-homocysteine and/or L- 
5 methionine as described in US Patent No. 5,885,767 (herein incorporated by reference). This 
assay is particularly useful in the diagnosis and treatment of homocystinuria, a serious genetic 
disorder characterized by an accumulation of elevated levels of L-homocysteine, L-methionine 
and metabolites of L-homocysteine in the blood and urine. Homocystinuria is more fully 
described in Mudd et al, "Disorders of transsulfuration," In: Scriver et al, eds., The Metabolic 
10 and Molecular Basis of Inherited Disease , McGraw-Hill Co., New York, 7'^ Edition, 1995, pp. 
1279-1327 (herein incorporated by reference). In developing an assay for the accurate 
quantitation of L-homocysteine and L-methionine according to the methods described in Patent 
No. 5,885,767, obtaining large amounts of methionine gamma-lyase is necessary. However, this 

O Pseudomonas gene contains a number of codons that are less commonly found in genes of 

Ci 15 desirable bacterial hosts for expression such as E. coll 

yj Similarly, genes from other organisms, such as yeast or mammals, can have utility as 

j! therapeutic agents, reagents, or catalysts. Examples include erythropoietin, human growth 

{h hormone, and eukaryotic oxidoreductases such as amino acid dehydrogenases, disulfide 

""^ reductases, and alcohol dehydrogenases. 

20 Because plasmid vectors designed to enhance expression with a variety of promoters or 

nj other regulatory elements often do not resolve the difficulty in expressing certain genes, and 

^ because no systematic approach exists for codon replacement to aid amplification of nucleic acids 

p or their expression, there is clearly a need for an improved method for amplification and 

O expression of genes, including genes from mammals and other animals, plants, yeast, fungi, and 

25 various bacteria such as streptomycetes, Bacillus, Pseudomonas and the like introduced into 
enteric bacterial hosts such as E, coli, 

SUMMARY OF THE INVENTION 

In one embodiment, the invention is directed to a method of making a synthetic nucleic 

30 acid sequence. The method comprises providing a starting nucleic acid sequence, which 
optionally encodes an amino acid sequence, and determining the predicted AGfoj^jng of the 
sequence. The starting nucleic acid sequence can be a naturally occurring sequence or a non- 
naturally occurring sequence. The starting nucleic acid sequence is modified by replacing at least 
one codon from the starting nucleic acid sequence with a different corresponding codon to 

35 provide a modified nucleic acid sequence. As used herein, "codon" generally refers to a 
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nucleotide triplet which codes for an amino acid or translational signal (e.g., a stop codon), but 
can also mean a nucleotide triplet which does not encode an amino acid, as would be the case if 

5 the synthetic or modified nucleic acid sequence does not encode a protein (e.g., upstream 
regulatory elements, signaling sequences such as promotors, etc.). As used herein, a "different 
corresponding codon" refers to a codon which does not have the identical nucleotide sequence, 
but which encodes the identical amino acid. The predicted AGfo,ding of the modified nucleic acid 
sequence is determined and compared with the AG^iding of the starting nucleic acid sequence. In 

1 0 accordance with the invention, the predicted AGj^i^jng of the starting nucleic acid sequence can be 
determined before or after the modified starting nucleic acid is provided. 

Thereafter, it is determined whether the AG^iding of the modified nucleic acid sequence 
is increased relative to the AG^oi^ing of the starting nucleic acid sequence by a desired amount, such 
as at least about 2%, at least about 10%, at least about 20%, at least about 30%, or at least about 

1 5 40%. If the AGfoiding of the modified nucleic acid sequence is not increased by the desired 
amount, the modified nucleic acid sequence is further modified by replacing at least one codon 
from the modified nucleic acid sequence with a different corresponding codon to provide a 
different modified nucleic acid sequence. These steps are repeated until the AG^i^ji^g of the 
modified nucleic acid sequence is increased by the desired amount to ultimately provide a final 

20 nucleic acid sequence, which is the desired nucleic acid sequence. 

In one embodiment, the invention is a synthetic polynucleotide designed by the methods 
of the invention. This includes a nucleic acid having the sequence of a polynucleotide designed 
by the methods or a sequence complementary thereto. 

In another embodiment, the invention includes a method of physically creating a tangible 

25 synthetic polynucleotide comprising creating a physical embodiment of the synthetic 
polynucleotide made using the nucleic acid/polynucleotide design methods of the invention, and 
the physical embodiments of the tangible synthetic polynucleotide prepared by this method (i.e., 
physical embodiments of the synthetic sequences, and copies of such sequences created by other 
methods. 

30 The modified and/or final nucleic acid sequence can then be physically created. By the 

present invention, a desired nucleic acid sequence can be created that is more highly expressed 
in a selected host, such as E. coli, an insect cell, yeast, or a mammalian cell, than the starting 
sequence. By "more highly expressed" is meant more protein product is produced by the same 
host than would be with the starting sequence, preferably at least 5% more, more preferably at 

35 least 10% more, and most preferably at least 20% more. 
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Preferably the codon replacement is in a region of the starting nucleic acid sequence or 
modified nucleic acid sequence containing secondary structure. It is also preferred that the 

5 different corresponding codon is one that occurs with higher frequency in the selected host. In 
a particularly preferred embodiment, the desired amino acid sequence is expressed \n Escherichia 
coll, and the amino acid sequence is from a bacterium of the genus Pseudomonas, and the 
different corresponding codon is selected to be one that occurs with higher frequency in a selected 
host, such as Escherichia coli than does the replaced codon. Alternatively, or in addition, the 

1 0 different corresponding codon is selected as one that has fewer guanine or cytosine residues than 
the replaced codon. 

In a particularly preferred embodiment, the starting nucleic acid sequence is derived, e.g., 
converted, from an amino acid sequence native to an organism different from the desired host for 
S expression, for example Pseudomonas, 

Ci 15 The method of the invention also provides a modified, final sequence that is more 

y amplifiable than the starting sequence. In other words, the final sequence is amplified more 

readily in a full length form, more rapidly or in greater quantity. 

In another embodiment, the invention is directed to a synthetic nucleic acid, sequence 
''i having a plurality of codons and encoding a methionine gamma- lyase protein from Pseudomonas 

L 20 putida. As used herein, the phrase "nucleic acid sequence encoding a protein" means that the 
HI nucleic acid sequence encodes at least the functional domain of the protein. The sequence having 

no more than about 95% homology, preferably no more than about 90% homology, more 
preferably no more than about 85% homology, still more preferably no more than about 80% 
homology, to a naturally occurring methionine gamma-lyase gene from Pseudomonas putida. 
25 At least about 5%, preferably at least about 10%, more preferably at least about 20%, still more 
preferably at least about 30%, even more preferably at least about 40%, of the codons in the 
synthetic nucleic acid sequence are different from codons found in the naturally occurring gene. 

In one aspect, the codons in the synthetic nucleic acid sequence encode the same amino 
acids as the codons in the naturally occurring gene. In another aspect, at least one of the codons 
30 in the synthetic nucleic acid sequence encodes an amino acid different from the numerically 
corresponding amino acid foxmd in the naturally occurring sequence. In yet another aspect, at 
least one of the different codons in the synthetic nucleic acid sequence is in an area of secondary 
structure in the naturally occurring gene. 

35 
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In another embodiment, the invention is directed to synthetic genes derived from any 
source, e.g., eukaryotic or prokaryotic, for improved expression in heterologous or homologous 
5 expression hosts. The synthetic nucleic acid sequences of the invention are comprised on non- 
naturally occurring polymers of nucleic acids, each sequence having a biological function 
encoded by the sequence. The biological function can be direct (e.g., the nucleic acid sequence 
possesses the function, as in a promotor, for example) or indirect (e.g., the nucleic acid serves as 
a template to encode another molecule such as RNA or protein which has a function), and is 
10 generally one that is known from a similar naturally occurring or synthetic sequence. However, 
the biological function of the synthetic sequence created using the methods of the invention need 
not be identical to a known or predicted biological function in a known starting sequence. For 
example, the function may be enhanced in the synthetic sequence, or an enzyme may act on one 
Q or more different substrates, use more or different co-factors, catalyze reactions at a different rate, 

1 5 etc. The synthetic sequences further have no more than about 95% homology to a knovra starting 
yj sequence, and have a different free energy of folding than does the starting sequence. Finally, the 

synthetic sequences of the invention have the characteristic that they are better expressed (e.g., 
more highly expressed, expressed under different conditions, or expressed with more desired 
characteristics) in a selected host cell than the starting sequence would be if expressed in the 
20 selected host cell. The host cell is generally heterologous, but may be homologous for the starting 
sequence (the artificial synthetic sequence, not being found in nature, has no homologous host). 

In one aspect, the synthetic nucleic acid sequence comprises a plurality of codons which 
encode amino acids and proteins. In preferred embodiments, the difference between the synthetic 
sequence and the starting sequence is that the synthetic sequence has at least one codon which 
25 is different from the starting sequence at the same amino acid position in the protein sequence. 
This codon may encode a different amino acid, the same amino acid, insert or delete an amino 
acid from that position, or encode a restriction site. Members of the oxidoreductase family are 
disclosed, and all members of this family or sequences encoding oxidoreductase functionality are 
among preferred sequences. Other preferred sequences include those encoding decarboxylase, 
30 formate dehydrogenase, hydantoinase, and vanillyl alcohol oxidase functions. Any sequence 
encoding a biological function from any source can be improved using the methods of the 
invention for enhanced expression or functionality. 

In another embodiment, the invention is directed to a method of creating a synthetic 
nucleic acid. The method comprises providing a sense nucleic acid sequence having a 5' end and 
35 a 3' end and providing an antisense nucleic acid sequence having a 5' end and a 3* end. Preferably 
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the sense and antisense nucleic acid sequences are between about 10 and about 200 bases, more 
preferably between about 80 and about 120 bases. The 3' end of the sense sequence has a 
5 plurality of bases complimentary to a plurality of bases of the 3' end of the antisense sequence, 
thereby forming an area of overlap. Preferably the area of overlap is at least 6 bases, more 
preferably at least 10 bases, still more preferably at least 15 bases. The 5' end of the sense 
sequence extends beyond the 3* end of the antisense sequence, and the 5' end of the antisense 
sequence extends beyond the 3' end of the sense sequence. The method further comprises 
10 annealing the sense and antisense sequences at the area of overlap. A polymerase and free 
nucleotides are added to the sequences. Said nucleotides may be naturally occurring, i.e.. A, T, 
C, G, or U, or they may be non-natural, e.g., iso-cytosine, iso-guanine, xanthine, and the like. 
The sequences can be annealed before or after addition of the polymerase and free nucleotides. 
O The sequences are extended, wherein the area of overlap serves to prime the extension of the 

C\ 15 sense and antisense sequences in the 3' direction, forming a double stranded product. The 
yj extended sequence can then be amplified. Further, a second step to the method can be added 

J: where the double stranded first extension product is separated into an extended sense strand and 

hj an extended antisense strand and a second set of sense and antisense nucleic acid sequences are 

"■^ provided having a 5' end and a 3' end. Each has a plurality of bases on its 3' end complementary 

20 to a plurality of bases on the 3' end of the extended sense or antisense strand respectively, thereby 
ry forming second and third areas of overlap. A polymerase and free nucleotides are added to the 

^ sequences and separated strands, wherein the second and third areas of overlap serve to prime a 

D second extension of the sequences and strands that encompasses the sequence of the first sense 

and antisense nucleic acid sequences and the second sense and antisense nucleic acid sequences. 

25 

DESCRIPTION OF THE DRAWINGS 

These and other features and advantages of the present invention will be better understood 
by reference to the following detailed description when considered in conjunction with the 
accompanying figures wherein; 
30 FIG. 1 A: DNA sequence of synthetic mdeA gene (1200bps with GGT insertion), called 

synmdeA. Nco I and BamH I cloning sites are engineered at 5' end and 3' end. The bold face 
uppercase nucleotides are the changed nucleotides from the original mdeA gene sequence. 

FIG. IB: First DNA segment, mdeAl (426 bps), with Nco I and Pst I cloning sites. 

FIG. IC: Second segment, mdeA2 (414 bps), with Pst I and EcoK I cloning sites. 
35 FIG. ID: Third segment, mdeAS (367 bps), with EcoR. I and BarriH I cloning sites. 
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FIG. 2 A: First round of amplification using long oligonucleotides to generate template 
(tpAl, tpA2, or tpA3) DNA for each of the three synmdeA segments mdeAl, mdeA2 or mdeAS, 
5 PGR amplification relies on overlapping sections of each oligonucleotide, which serves to prime 
the extension of the neighboring segment. 

FIG. 2B: Second round of amplification using the two short oligonucleotides to amplify 
the fiill-length segments, mdeAl, mdeA2 ox mdeAS . The short oligonucleotides overlap with the 
5' ends of the sense and antisense strands to form a template primed by the tpAl, tpA2, or tpA3 
10 strands, resulting in the filling in of both 5' and 3' ends of mdeAl, mdeA2 and mdeA3 after the 
second round of PGR. 

FIG. 3 is a schematic of the cloning strategy for mdeAl, mdeA2 and mdeA3 into cloning 
and expression vectors. The amplified segments are ligated into the multiple cloning site of the 
illustrated vector in the top row, then E. coli are transformed with the plasmids. Individual 
15 plasmids containing each segment are selected in the second row, and the plasmids are double- 
digested to extract the insert, which is then ligated into an expression vector as shown in the last 
row. 

yj FIG. 4A is a gel showing expression of a synthetic P. putida methionine gamma lyase 

synmdeA in BL21/pTM vector prior to and after induction with IPTG. All cultures were grown 
M= 20 at ST^'C. synmdeA was cloned into pET15b (available from Novagen) under the control of T7 
^ RNA polymerase promotor. Lanes are: M - prestained protein molecular weight standards, high 

2 range, as indicated on the figure; 1 and 2 - three hours induction with 0.1 mM IPTG; 3 - three 

O hours induction with 0.5 mM IPTG; 4 - three hours induction with 1 mM IPTG; 5 - three hours 

'"^ induction with 2 mM IPTG; 6 - not induced. 

25 FIG. 4B is a gel showing the poor expression of native P. putida methionine gamma lyase 

(mdeA) in pSIT vector prior to and after induction with IPTG. All cultures were grown at 37 ""C. 
The induced samples contain extra bands at about 28 kD due to premature termination of mdeA 
translation. Native mdeA was cloned into the pSIT vector under the control of the T7 RNA 
polymerase promotor. Lanes are: M - prestained protein molecular weight standards, high range, 
30 as indicated on the figure; 1 - not induced; 2 and 3 - three hours induction with 0.5 mM IPTG; 
4 and 5 - three hours induction with 1 mM IPTG. 

FIG, 5 shows expression in E, coli of two genes with very different AG^idingr naphthalene 
dioxygenase (NDO) from Pseudomonas putida (AG = -256. 1 kcal/mol) and methionine gamma 
lyase {mgl I) from T. vaginalis (AG = -152.5 kcal/mol). Lanes 1-4 are NDO products, and 5-9 
35 are MGL 1 products. Lanes are as follows: Ml - multimark multi-colored standard; M2 - 
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prestained protein molecular weight standards; 1 - not induced; 2- three hours induction with 
0.02% L-arabinose; 3 - three hours induction with 0.04% L-arabinose; 4 - three hours induction 
5 with 0.08% L-arabinose; 5 - not induced; 6 - three hours induction with 0.02% L-arabinose; 7 - 
three hours induction with 0.04% L-arabinose; 8 - three hours induction with 0.08% L-arabinose; 
9 - three hours induction with 0. 1 0% L-arabinose. Both genes were cloned into the pB AD vector. 
Cells were grown at 37°C. Expression of mgl /, having a less negative AG^iding was superior to 
NDO expression. 

10 FIG. 6 is a gel showing expression of native and synthetic genes developed using the 

methods of the invention. Lane 1 is a negative control (empty pB AD vector); Lanes 2 and 3 show 
expression of synthetic aldehyde reductase 2 containing an A25 to 025 mutation (synALR2mut) 
induced at 30°C and 37''C, respectively; Lanes 4 and 5 show expression of native yeast putative 
reductase 1 (YPRl) induced at 30^*0 and 37°C5 respectively; Lanes 6 and 7 show the synthetic 
^ 15 version, synYPRl, induced at 30°C and 37°C, respectively; and Lanes 8 and 9 show expression 
y of synthetic aldehyde reductase 1 (synALRl) induced at 30°C and 37°C, respectively. All 

sequences except synALRl were induced for 3 hours with L-arabinose. synALRl was cloned 
|y into a different vector and induced for 3 hours with IPTG. 

'"^ FIG. 7 is a gel comparing expression of native and synthetic formate dehydrogenase 

y, 20 (Fdhl .2 and synFdh, respectively) induced with L-arabinose for 3 hours at 30 °C and 37 °C, and 
W uninduced. Lane 1 is Fdhl. 2 induced at SO^'C; Lane 2 is synFdh at 30°C; Lane 3 is Fdhl. 2 at 

37°C; Lane 4 is synFdh at 37°C; and Lane 5 is uninduced Fdhl .2. 

FIG. 8 graphically represents enzyme activity of synthetic formate dehydrogenase 
(synFdh) created using the methods of the invention (open triangles) as compared to native 
25 Fdhl .2 (open squares, induced; open circles uninduced), using an assay to catalyze the oxidation 
of formate in the presence of NAD"". 

DETAILED DESCRIPTION OF THE INVENTION 

In one embodiment, the invfe;ntion is directed to developing nucleic acid sequences that 
30 enhance expression of the encoded prbtein in a heterologous host. The frequency of particular 
codon usage for E, coli and other enteriVbacteria is shown in Table 1, below. This table is 
derived from the 2000 Novagen ^atalog, page 196, available online at 
http://www.novagen.com/html/catfram.html;\erein incorporated by reference. However, the 
information in this table does not tell one of skilNin molecular biology which codons should be 
35 replaced to enhance expression, if indeed an\ replacements will enhance expression. 
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Considerations other than simple cod^i replacement are clearly important. It has been discovered 
that the composition of the full gene (bx fragment to be expressed) is more important than a 
particular codon exchange, and heterologies expression can be enhanced by replacement of 
codons in the sequence's open reading frame abne, independent of promo tors or other regulatory 
sequence. 



Table 1 



35 



aa 


Codon 


/lOOO* 


Fraction^ 


aa 


Codon 


/lOOO' 


Fraction^ 


Gly 


GGG 


1.89 


0.02 


Trp 


UGG 


7.98 


1.00 


Gly 


GGA 


0.44 


0.00 


stop 


UGA 


0.00 


(stop) 


Gly 


GGU 


52.99 


0.59 


Cys 


UGU 


3.19 


0.49 


Gly 


GGG 


34.55 


0.38 


Cys 


UGC 


3.34 


0.51 


Glu 


GAG 


15.68 


0.22 


stop 


UAG 


0.00 


(stop) 


Glu 


GAA 


57.20 


0.78 


stop 


UAA 


0.00 


(stop) 


Asp 


GAU 


21.63 


0.33 


Tyr 


UAU 


7.40 


0.25 


Asp 


GAG 


43.26 


0.67 


Tyr 


UAC 


22.79 


0.75 


Val 


GUG 


13.50 


0.16 


Leu 


UUG 


2.61 


0.03 ■ 


Val 


GUA 


21.20 


0.26 


Leu 


UUA 


1.74. 


0.02 


Val 


GUU 


43.26 


0.51 


Phe 


uuu 


7.40 


0.24 


Val 


GUC 


5.52 


0.07 


Phe 


uuc 


24.10 


0.76 


Ala 


GCG 


23.37 


0.26 


Ser 


UCG 


2.03 


0.04 


Ala 


GGA 


25.12 


0.28 


Ser 


UCA 


1.02 


0.02 


Ala 


GCU 


30.78 


0.35 


Ser 


UCU 


17.42 


0.34 


Ala 


GCC 


9.00 


0.10 


Ser 


UGC 


19.02 


0.37 


Arg 


AGG 


0.15 


0.00 


Arg 


GGG 


0.15 


0.00 


Arg 


AGA 


0.00 


0.00 


Arg 


GGA 


0.29 


0.01 


Ser 


AGU 


1.31 


0.03 


Arg 


GGU 


42.10 


0.74 


Ser 


AGO 


10-31 


0.20 


Arg 


CGC 


13.94 


0.25 


Lys 


AAG 


16.11 


0.26 


Gin 


GAG 


33.83 


0.86 


Lys 


AAA 


46.46 


0.74 


Gin 


CAA 


5.37 


0.14 


Asn 


AAU 


2.76 


0.06 


His 


CAU 


2.61 


0.17 


Asn 


AAG 


39.78 


0.94 


His 


CAC 


12.34 


0.83 


Met 


AUG 


24.68 


1.00 


Leu 


GUG 


69.69 


0.83 


lie 


AUA 


0.15 


0.00 


Leu 


CUA 


0.29 


0.00 


lie 


AUU 


10.16 


0.17 


Leu 


GUU 


3.63 


0.04 


lie 


AUG 


50.09 


0.83 


Leu 


cue 


5.52 


0.07 


Thr 


ACG 


3.63 


0.07 


Pro 


GCG 


27.58 


0.77 


Thr 


ACA 


2.03 


0.04 


Pro 


CCA 


5.23 


0.15 
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aa Codon /lOOO' Fraction' aa Codon /lOOO' Fraction^ 



Thr 


ACU 


18.87 


0.35 


Pro 


ecu 


2.76 


0.08 


Thr 


ACC 


29.91 


0.55 


Pro 


CCC 


0.15 


0.00 



Expected number of occurrences per 1000 codons in enteric bacterial genes whose codon usage 

is identical to that compiled in the frequency table. 

Fraction of occurrences of the codon in its synonymous codon family. 



The present invention encompasses highly ampUfiable, expressible oligonucleotides, 
polynucleotides, and/or genes and is directed to methods of designing and physically creating 
these nucleic acid sequences. In one embodiment, the present invention is directed to a method 
of designing and physically creating genes that express well when introduced into heterologous 
expression hosts, such as from eukaryotic sources into prokaryotic hosts, e.g., common enteric 
bacterial host microorganisms such as E, coli. The invention allows expression of genes from 
various organisms, such as mammals and other animals, plants, yeast, fimgi, and bacteria (e.g., 
pigs, Saccharomyces, streptomycetes. Bacillus, Pseudomonas and the like) in prokaryotic hosts 
such as E. coli and eukaryotic hosts at commercially viable levels, even proteins with typically 
low yields, such as methionine gamma-lyase from P. putida. As used herein, the terms 
"polypeptide," "protein" and "amino acid sequence" are used interchangeably and mean 
oligomeric polyamides of at least two amino acids, whether or not they encompass the full-length 
polypeptide encoded by a gene or merely a portion of it. "Heterologous" indicates that the 
sequence is not native to the host used or identical to a sequence which naturally occurs in the 
host used, or refers to a host which is not the natural source of a nucleic acid or peptide sequence. 
"Designing" means conceiving a sequence of nucleotides in a form that can be written or printed. 
Such sequence may correspond to the coding region of an entire gene, or only a portion of it, and 
may also include additional bases added at a particular location or position, for example to create 
desired restriction sites or to insert mutations to enhance the protein's function. "Physically 
creating" means preparing a chemical entity such as an oligonucleotide/polynucleotide or 
polypeptide, whether by synthesis by chemical and/or enzymatic methods, biosynthesis, a 
combination of synthesis and PGR, or by any other methods known in the art. "PGR" means 
polymerase chain reaction. 

In the present invention, the sequence of a gene is modified to enhance its ability to be 
amplified, for example by PGR methods, and/or to improve its expression in a selected host, for 
example, an enteric bacterium such as E. coli. This is achieved by designing a nucleotide 
sequence preferably using codons preferred by the host, calculating the AGf^^uj^g of the nucleic 
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acid sequence (the amount of energy required for or released by folding in solution, in kcal/mole), 
modifying the sequence by replacing one or more codons in the sequence in one or more areas 

5 of predicted secondary structure with less preferred codons to reduce predicted secondary 
structure, and recalculating the AGfouing of the modified nucleic acid sequence. The replacement 
of codons and recalculation of the free energy of folding may be repeated as many times as 
desired. One, some, or all codons encoding a particular amino acid may be replaced in the region 
of secondary structure, or throughout the entire coding region of the sequence. The result is a 

10 modified final nucleic acid sequence, for example a synthetic gene encoding a desired complete 
or partial protein, whether a mutant protein or one having the desired structural and functional 
attributes of a native protein. The final synthetic sequence may be optimized for only a single 
selected host, but the methods of the invention are readily operably for a starting sequence from 
any source for expression in any selected host, whether animal, plant, fungal, prokaryotic, etc. 

1 5 As used herein, the term "synthetic" gene, nucleic acid, oligonucleotide, polynucleotide, 

primer, or the like means a nucleic acid sequence that is not found in nature; in other words, not 
merely a heterologous sequence to a particular organism, but one which is heterologous in the 
sense that it has been designed and/or created in a laboratory, and is altered in some way, and that 
it does not have exactly the nucleotide (or possibly amino acid) sequence that its naturally 

20 occurring source, template, or homolog has. A synthetic nucleic acid or amino acid sequence as 
used herein can refer to a theoretical sequence or a tangibly, physically created embodiment. It 
is intended that synthetic sequences designed by the method be included in the invention in any 
form, e.g., paper or computer readable ("theoretical"), and physically created nucleic acids or 
proteins. Physically created nucleic acids and proteins of the invention are part of the invention, 

25 whether derived directly from the designed sequence, or copies of such sequences (e.g., made by 
PGR, plasmid replication, chemical synthesis, and the like). The term "synthetic nucleic acid" 
can include, for example, nucleic acid sequences derived or designed from wholly artificial amino 
acid sequences, or nucleic acid sequences with single or multiple nucleotide changes as compared 
to the naturally occurring sequence, those created by random or directed mutagenesis, chemical 

30 synthesis, DNA shuffling methods, DNA reassembly methods, or by any means known to one 
of skill in the art (see e.g., techniques described in Sambrook and Russell, "Molecular Gloning; 
A Laboratory Manual," 3'^ Ed., Cold Spring Harbor Laboratory Press (2001), herein incorporated 
by reference). Such alterations can be done without changing the amino acid sequence encoded 
by the nucleic acid sequence, or can modify the amino acid sequence to leave a desired function 

35 of the encoded protein unaltered or enhanced. As used herein, "nucleic acid" means a naturally 
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occurring or synthetic nucleic acid, which can be composed of natural or synthetic nitrogen bases, 
a deoxyribose or ribose sugar, and a phosphate group. 

"Secondary structure" refers to regions of a nucleic acid sequence that, when single 
stranded, have a tendency to form double-stranded hairpin structures or loops. Such structures 
impede transcription (or amplification in vitro) and translation of affected regions in the nucleic 
acid sequence. Nucleic acids can be evaluated for their likely secondary structure by calculating 
the predicted AGfo,ding of each possible structure that could be formed in a particular strand of 
nucleic acid. Energy must be released overall to form a base-paired structure, and a structure's 
stability is determined by the amount of energy it releases. The more negative the AGfoijing (i.e., 
the lower the free energy), the more stable that structure is and the more likely the formation of 
that double-stranded structure. 

Computer programs exist that can predict the secondary structure of a nucleic acid by 
calculating its free energy of folding. One example is the mfold program, which can be found at 
http://mfold2.wustl.edu/--mfold/dna/forml .cgi (using free energies derived from SantaLucia Proc. 
NatL Acad Sci. USA 95:1460-1465 (1998); see also Zuker, Science, 244, 48-52, (1989); Jaeger 
etaL,Proa NatL Acad Sci. USA, Biochemistry, 86:7706-7710 (1989); Jaeger a/. , Predicting 
Optimal and Suboptimal Secondary Structure for RNA. in "Molecular Evolution: Computer 
Analysis of Protein and Nucleic Acid Sequences", R. F. Doolittle ed.. Methods in Enzymology, 
1 83, 28 1 -306 (1 989); all herein incorporated by reference). Another example of such a computer 
program is the Vienna RNA Package, available at http://www.ks.uiuc.edu/~ivo/RNA/, which 
predicts secondary structure by using two kinds of dynamic programming algorithms: the 
minimum free energy algorithm of Zuker and Stiegler {Nucl Acid. Res, 9: 133-148 (1981)) and 
the partition fiinction algorithm of McCaskill (Biopolymers 29, 1 105-1 119 (1990)). Distances 
(dissimilarities) between secondary structures can be computed using either string alignment or 
tree-editing (Shapiro & Zhang 1990). Finally, an algorithm is provided to design sequences with 
a predefined structure (inverse folding). 

Modifications to reduce secondary structure in DNA sequences by altering codon usage 
can be made in several ways. As used herein, "replacing codons" or "altering codon usage" 
means altering at least one of the nucleotides making up the three nucleotides of the codon triplet. 
It is understood that this change can occur at a "wobble" position to leave the amino acid encoded 
unchanged, or at another position or to a base that results in a change in the encoded amino acid. 
For example, the codon changes can be designed to swap out codons for a particular amino acid 
in the sequence (e.g., at a designated position in the sequence) which are not common in the 
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selected host (following e.g., Kane, supra, or Zahn, supra). Further, codons can be replaced to 
reduce the G-C content of the naturally occurring codon. 

5 The inventive methods of the present invention produce sequences with superior 

expression characteristics because they take more than one variable into account. The methods 
involve designing a nucleic acid sequence based on a desired amino acid sequence using the 
codons most commonly used for each amino acid in the chosen host organism (of course, an 
additional step of analyzing the AGfoi^ing of a native sequence may be performed as well). Next, 

10 the predicted free energy of folding for the designed sequence is calculated using a computer 
program as described previously. The program mfold is used in the Examples provided herein, 
although any similar program may be used in the practice of this invention. In calculating the 
predicted AGfoj^jing, the full-length nucleotide sequence can be analyzed as a single entity, or the 
full-length sequence can be divided into shorter segments and the predicted AGf^iding for ^^^h 

1 5 segment can be calculated separately, and then added together. 

After the predicted AGf^iding is calculated, changes to the sequence are made to try to 
reduce the formation of secondary structure. Regions of predicted secondary structure are 
identified using, for example, one of the computer programs previously described, and changes 
are made in codons in these identified regions. Preferably, codon changes are selected to favor 

20 niore frequently occurring codons in the host organism selected to express the synthetic gene. 
Thus, one or more codons in regions of predicted high secondary structure are changed to the 
second or third most commonly used codon choice for the chosen host organism, and the 
predicted AGf^idj^g is recalculated. This process of codon changes and recalculation of the 
predicted AGf^iding is repeated until the predicted AGfo^ing of the sequence examined (e.g., the 

25 entire sequence or a portion) is increased (made less negative) by greater than about 2%, 
preferably greater than about 10%, more preferably greater than about 30%, as calculated by 
AGfo,di„g/(number of bases in the sequence analyzed). The starting sequence for the step of 
designing a sequence (e.g., the naturally occurring sequence) is set as 100%. It is likely that the 
change in AGfo,dj„g between the starting sequence and the final product will be smaller when the 

30 starting sequence is a completely synthetic sequence based solely on preferred codon usage than 
when the starting sequence is a naturally occurring sequence from a heterologous organism. 
AGfoiding for segments analyzed separately can be added to arrive at a AGfo^ing for the entire 
sequence, or the AGfoUi^g for the entire sequence can be determined in a single calculation. Once 
the AGfoiding for the entire sequence has been so determined, it is divided by the sequence length 

35 in bases to arrive at a uniform measure of AGfoi^jng for coniparison of sequences of unequal length. 
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It is also possible that a synthetic sequence may have a more negative AGfo,di„g than its 
counterpart native sequence. This condition may occur when codon choices must be made to 
5 accommodate a particular expression host, or when the native sequence has very little secondary 
structure to begin with. Preferably, this situation occurs in cases where the native sequence does 
not have a great deal of secondary structure (see, e.g., Example 9 and Fdh2.1 and SynFdh). 
Regardless, in such cases, the difference in AG/base between the native sequence and the more 
negative synthetic sequence is preferably less than 0. 1 kcal/(mol)(base), more preferably less than 
10 0.05 kcal/(mol)(base), and most preferably less than about 0.03 kcal/(mol)(base). 

Several variants can be analyzed to illustrate the advantages of the inventive method, 
summarized in Table 2 below. A naturally occurring (native) mdeA gene from P. putida (SEQ. 
ID NO. 1) was used as the starting sequence, and its AGf-o,ding was calculated (all AGfoi^jng results 
reported herein were carried out assuming a temperature of 37°C, Na"^ = 1 M, and Mg'^^= 0) and 
1 5 set at 1 00%. This sequence was modified by replacing rare arginine codons (termed "repmdeA 
modifications derived from Zahn, supra) with one found most commonly in E, coli (SEQ ID NO. 
28). The change in AGfo,ding/base from this replacement was 1 .9%. A more significant alteration 
of mdeA was performed by replacing all of the rare codons mentioned in Kane, supra. This 
sequence was made by exchanging agg, aga, and cga codons with cgt (arginine), eta codons with 
L 20 ctg (leucine), ata with ate (isoleucine), and ccc with ccg (proline) (termed ^'raremdeAf SEQJD 
ry NO. 29). As seen in Table 2, below, this exchange also did not significantly impact the AG of 

^ the sequence, resulting in a change in AGfoiji^g/base of only 1% as compared to the native 

□ sequence. Simply replacing a rare codon does not necessarily increase AGf^i^jing, and in fact, could 

□ lower AGfoiding, creating or failing to resolve problems in transcription or translation, or in 
25 amplification by PGR methods. 

Because the codons known in the art to be rare and potentially to have an impact on 
expression did not significantly improve the AG^iding of the sequence, all codons of mdeA's open 
reading frame were exchanged for the most common codons in enteric bacteria from Table 1, 
above (a sequence termed "optmdeA;'" SEQ ID NO. 30). The AGfo,di„g of this sequence was 

30 increased 31.8% by this change compared to mdeA, a significant improvement. However, when 
the sequence optmdeA was analyzed for regions of predicted secondary structure, replacements 
of codons in areas of high secondary structure were made to generate the designed sequence 
synmdeA (SEQ ID NO. 3). The predicated AGf^,,ding was recalculated for this sequence, and a 
superior sequence with a greatly improved AGf^jidi^g was created. In this case, AGfoMing was 

35 increased (made less negative) by 40.7% compared to the starting native sequence. Thus, it is 
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clear that the inventive methods of developing the synthetic sequences go well beyond any 
suggestions in the art pertaining to codon exchange. 



Table 2 



Sequence 


AG (kcal/mol) 


AG/base 


% Change in AG 


mdeA (1197 bp) 


-256.6 


-0.214 


0% 


repmdeA (1197 bp) 


-251.8 


-0.210 


1.9% 


raremdeA (1197 bp) 


-254.0 


-0.212 


1.0% 


optmdeA (1200 bp) 


-175.5 


-0.146 


31.8% 


synmdeA (1200 bp) 


-152.5 


-0.127 


40.7% 



The method described herein offormulating synthetic sequences for iniproved expression , 
can be used for any nucleic acid sequence, even those being expressed in homologous hosts, or 
with relatively little predicted secondary structure. Most commonly, hov^ever, the need to 
improve expression will arise when expressing proteins in heterologous hosts. Regardless, any 
starting sequence, preferably with a AGfoiji^g /base of about -0.05 kcal/(mole)(base) or less, and 
more preferably with a AGf^,,di„y /base of about -0.15 kcal/(mole)(base) or less, and most preferably 
with a AGfoiding /base of -0.2 kcal/(mole)(base) or less can be improved for better expression using 
the methods of the invention. When a AGf^iding l^ss than about -0.20 kcal/(mole)(base) or an 
increase of at least about 2% from the starting sequence is reached, the actual sequence of the 
synthetic DNA can be physically created. Such physical creation of the designed oligonucleotide 
sequence can be accomplished by any of the methods known in the prior art, for example by 
oligonucleotide synthesis, or by the nucleic acid synthesis methods of the invention (described 
more fully below). 

Additionally, the invention takes advantage of the improved secondary structure 
characteristics of the synthetic nucleic acid for enhanced ampUfication capability, for example 
using PGR methods. Some of the same features of native nucleic acid sequence that make them 
difficult to express in heterologous hosts may also make them difficult to clone or amplify. High 
secondary structure in one or more regions of the nucleic acid can make cloning or PGR difficult 
or impossible to perform on the intact nucleic acid or even on segments of the nucleic acid. 
However, using the methods of the invention to reduce the secondary structure, the resulting 
nucleic acid templates have better properties for polymerization and amplification. Making 
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synthetic nucleic acids that amplify easily has important ramifications for common molecular 
biology procedures such as site directed mutagenesis. For example, using the methods of the 
5 invention, a nucleic acid sequence encoding a particular protein (a native protein, a protein with 
one or more desired mutations, or a completely artificial protein) can be designed using codons 
used more commonly in a desired expression host cell, and the predicted AG^iding may then 
optimized as described herein. Regardless of the features of the polymerase, or any particular 
weaknesses it may have (e.g., poor processivity), the probability of accurate full length synthesis 
10 of the copy strand from the template is improved using the synthetic nucleic acid of the invention 
because the regions of secondary structure have been reduced. Codons are replaced overall to 
minimize AGfoiding kcal/(mole)(base), but in specific locations also to alter the amino acid 
sequence encoded by the nucleotide sequence, resulting in a nucleic acid sequence encoding a 
particular protein with improved amplification and expression properties. 
15 In one embodiment of this invention, the design and preparation of synthetic genes are 

W used in application of directed evolution, gene shuffling and molecular breeding methods. 

J; Examples of gene shuffling and molecular breeding are described in US Patent 5,605,793, US 

y Patent 5,81 1,238, US Patent 5,830,721, US Patent 5,837,458, US Patent 5,965,408, US Patent 

5,958,672, US Patent 6,001,574, all herein incorporated by reference. Genes to be shuffled or 
u 20 recombined are designed and/or physically created based on the incorporation^of preferred codons 
\}4 as described in the present invention. Such synthetic genes can also be created with greater 

S homology, improving the reassembly of fragments in gene reassembly and shuffling methods. 

□ The advantage of the use of genes designed and physically created as described herein is the 

*^ improved formation and expression of the shuffled or recombined genes. Such improved 

25 expression facilitates screening by providing higher levels of the gene products that are to be 
detected. The time required for screening can be reduced, or certain enzymatic activities can be 
detected more easily. Improvements in gene products, whether enzymes or metabolites produced 
by the actions of two or more different proteins derived through molecular breeding or directed 
evolution methods, can be detected more readily. Genes designed and produced according to the 
30 methods of the present invention can also be incorporated into kits for screening or other 
purposes. An example of an enzyme screening kit is found in US Patent 6,004,788, herein 
incorporated by reference. 

Another embodiment of the invention, illustrated in the examples below, involves an 
improved method of synthesizing a nucleic acid. Usual methods of synthesizing a desired nucleic 
35 acid sequence which is not found in nature involves difficult and expensive chemical synthesis. 
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The synthesis method of the invention to create a synthetic sequence involves an amplification 
method, such as PGR, using synthesized oligonucleotides designed to be overlapping, having as 
many adjacent sense and antisense strands as desired or required to complete the synthetic gene 
of choice. The oligonucleotides serve as both the template and primer in this PCR-based 
synthesis strategy. 

The examples described herein demonstrate one implementation of the method for the 
physical creation of a synthetic gene. Two rounds of PGR reactions were carried out on three 
segments of the synmdeA gene, and six oligonucleotides per segment were used to construct the 
synthetic gene. The segments were ligated, amplified, excised, and inserted into an expression 
vector. The first round of amplification involved creating four long oligonucleotides (around 1 00 
bps) based on the synthetic sequence. These long oHgonucleotides were used to generate 
template DNA for various segments of the sequence. Longer synthetic sequences are best broken 
into shorter segments in this method for easier amplification. The first round PGR amplification 
relies on overlapping sections of each long oligonucleotide, to create areas of overlap. The areas 
of overlap serve to prime the extension of the neighboring segment. The areas of overlap can be 
any length that is sufficient for specificity and long enough for polymerase 
recognition/attachment, preferably at least 10 bases and more preferably at least 15 bases of 
overlap. 

The second round of amplification used two short oligonucleotides (each about 30 
nucleotides) to amplify the full-length segments. The short oligonucleotides overlap the 5' ends 
of the sense and antisense strands from the previous round to form a template of each segment 
primed by the first round strands, resulting in the filling in of both 5' and 3' ends after the second 
round of PGR. The segments derived from this two-round PGR are ligated together to form the 
unitary synthetic sequence. Preferably, this is facilitated using naturally occurring or synthesized 
restriction sites. Such sites enhance unidirectional cloning, ligation, etc. 

It is understood that any nucleic acid and any reaction conditions that do not require 
• exactly this sort of overlap and/or priming (e.g., RNA, RNA polymerases) can be used to create 
a modified nucleic acid of the invention without departing from the scope of the invention, and 
that other means of synthesizing the desired gene of interest are possible using methods known 
in the art. It is further understood that the gene or nucleic acid can be synthesized in one or 
several pieces. Likewise, many vectors and host species and strains other than those used herein 
can be used successfully in the practice of the invention. 
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The invention is described more fully in the following Examples, which are presented for 
illustrative purposes only and are not intended to limit the scope of the invention. In the 
5 embodiment of the invention disclosed by the Examples, a synthetic gene was designed which 
encodes the enzyme methionine gamma-lyase. Methods and vectors for its cloning and 
expression are provided, although other methods/vectors can be used. 

EXAMPLE 1 - Design of a synthetic gene sequence 
10 In these Examples, a specific synthetic gene sequence is disclosed encoding naturally 

occurring P, putida methionine gamma-lyase gene sequence, and consists of codons common to 
enteric bacteria such as E, colL Also described are three gene fragments derived from the 
complete synthetic methionine gamma-lyase gene that have unique cloning sites at each end of 

□ each fragment. 

S 15 

Ul Materials: 

p DNA taq polymerase and T4 DNA ligase were purchased from Roche (Branchburg, NJ). 

yj Restriction endonucleases were purchased from New England Biolabs. Any suitable expression 

^ vector, such as pET15b expression vector and E. coli BL21(DE3), available from Novagen 

^ 20 (Madison, WI), may be used to express the synthetic sequences. pBAD expression vector and 
nJ E. coli LMG 194 were purchased from Invitrogen (Carlsbad, CA). pGEM-3Z, pGEM-5Zf(+) 

cloning vectors and E. coli JM109 were purchased from Promega (Madison, WI). The 
oligonucleotides for PGR amplification were synthesized by IDT Inc. (Coral ville, I A). QIAquick 
gel extraction kit and QIAprep spin miniprep kit were purchased from QIAGEN, Inc. (Valencia, 
25 CA). 

Equipment: 

Thermocycler Perkin Elmer model 9600 (1991). 
Centrifuge 
30 Water bath incubator 
Culture incubator 
Electrophoresis devices 



35 
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Software: 

mfold - Prediction of RNA secondary structure by free energy minimization; Versions 2.0 and 

3.0: suboptimal folding with temperature dependence. Michael Zuker and John Jaeger; 

Macintosh version developed by Don Gilbert 
DNA strider 1 .01 - a C program for DNA and protein sequence analysis designed and written 

by Christian Marck, Service de Biochime-Departement de Biologic, Institut de Recherche 

Fondamentale Commissariat a V Energie Atomique- France 
HyperPCR - a HyperCard v. 20 stack to determine the optimal annealing temperature for PCR 

reaction and complementarity between the 3' ends of the two oligos and for internal 

complementarity of each 3' end. Developed by Brian Osborne, Plant Gene Expression 

Center, 800 Buchanan St., Albany, CA 94710 
Amplify 1.2 - for analyzing PCR experiments. Bill Engels 1992, University of Wisconsin, 

Genetics, Madison, WI 53706, WREngels@macc.wisc.edu 
Lasergene 99 - a complete DNA sequence analysis system. DNASTAR, Inc., 1228 South Park 
. St., Madison, WI 53715. 

Design of synthetic DNA sequence encoding Pseudomonas putida methionine 
gamma-lyase. 

The DNA sequence of naturally occurring mdeA gene was obtained from Entrez 
nucleotide Query (NID g22 1 7943) (SEQ ID. NO. 1 ). Based on this DNA sequence and the amino 
acid sequence deduced from its open reading frame, several of the original codons were changed 
to codons that are more commonly used in enteric bacteria. The resulting designed sequence is 
shown in FIG. 1 A (SEQ ID NO. 2). After changing codons to those more commonly used in E. 
coli, the computer program mfold was run to calculate the predicted AGfoi^jng the sequence. The 
computer program was then used to generate an image of the predicted oligonucleotide, and 
regions of predicted secondary structure were identified. Codons in regions of high secondary 
structure were changed to the second most commonly used codon for that amino acid in E coli, 
and the predicted AGfouipg the sequence was recalculated. 

In addition, the sequence was modified to incorporate a non-naturally occurring glycine 
at amino acid position 2. The synthetic sequence therefore does not encode a protein identical 
to the naturally occurring polypeptide encoded by the P. putida methionine gamma-lyase gene. 
The modification of the sequence was incorporated to facilitate unidirectional cloning of the 
synthetic sequence into the cloning and expression vectors using an Nco I restriction site. The 
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modified DNA sequence was termed synmdeA (SEQ ID NO. 2). In this Example, approximately 
fifty percent of the codons were changed from those found in the naturally-occurring gene. 

5 

EXAMPLE 2 - Amplification of the synthetic DNA fragments mdeAL mdeA2. mdeA3 
Oligonucleotide Design: 

Oligonucleotide primers were synthesized on the basis of the nucleic sequence of the 
synmdeA gene, whose sequence was determined from the process described in Example 1 . The 
10 synmdeA gene, with 1200 bps of coding sequence (1207 bps with residual bases from restriction 
sites included) (SEQ ID NO. 3), was broken down into three fragments, mdeAl, mdeA2, and 
mdeA 3 . The first cloning fragment, mdeA I , contained a Nco I cloning site at the 5 ' end and a Pst 
I cloning site at the 3' end, and was 426 bps after the double stranded product was digested (SEQ 
5 ID NO. 4), 441 bps after second round amplification but before digestion (FIG. IB; SEQ ID NO. 

S| . 15 5). The second cloning fragment, mdeA2, contained a Pst I cloning site at the 5' end and an 
^ EcoRl cloning site at the 3 ' end, and was 4 1 0 bps after digestion (SEQ ID NO. 6), 430 bps after 

l4j second round amplification but before digestion (FIG. IC; SEQ ID NO. 7). The third one, 

yj mdeA3, contained an EcoR I cloning site at the 5' end and a BamH I cloning site at the 3' end, 

and was 366 bps after digestion (SEQ ID NO. 8), 383 bps after second round amplification but 
H= 20 before digestion (FIG. 1 D; SEQ ID NO. 9). The segments were the product of intemal restriction 
sites occurring in the synmdeA sequence. Restriction sites were chosen that roughly divided the 
sequence into three equal segments, and which correspond to common multiple cloning sites on 
commercially available vectors. 

To synthesize the segments, or fragments, four long oligonucleotides (98-1 17 bps), and 
25 two short oligonucleotides (-^-30 bps) were designed for each fragment, and with the help of 
computer software, their self-folding secondary structures were minimized as much as possible 
in order to maximize the DNA synthesis during PGR reactions. All the oligonucleotides had 
secondary structure AG's less negative than the AG's of the two overlapping annealed fragments, 
decreasing the probability of secondary structure forming instead of oligonucleotide 
30 hybridization. 

Two short oligonucleofides and four long oligonucleofides were designed for each of the 
three segments. They were designed to have 17 to 18 bps overlap with each other. Underlined 
nucleotides indicate the annealing regions between two adjacent oligonucleotides. 

35 
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1. First see.ment of svnmdeA: mdeAl 
The sequences of these oligonucleotides was as follows: 

mdePrl-l (33 bps): 5' CAAGAGGCC ATG GOT CAC GGC TCC AAC AAA CTG 3^ (sense) 
(SEQ ID NO. 10) 

mdePrl-2 (114 bps): 5' CAC GGC TCC AAC AAA CTG CCG GGC TTT GCT ACC CGC 
GCT ATC CAC CAC GGT TAT GAC CCG CAG GAT CAC GGT GGT GCA CTG 
GTT CCG CCG GTT TAC CAG ACT GCT ACT TTC ACC 3' (sense) (SEQ ID NO. 
11) 

mdePrl-3 (116 bps) : 5' GC TTC CAG CAG GTT CAG GGT CGG GTT GGA GAT ACG 
GGA GTA GAA GTG ACC AGC CTG TTC GCC AGC AAA GCA CGC AGC GCC 
GTA TTC AAC GGT CGG GAA GGT GAA AGT AGC AGT CTG 3 ' (antisense) (SEQ 
ID NO. 12) 

mdePrl-4 (117 bps): 5' CTG AAC CTG CTG GAA GCA CGT ATG GCA TCT CTG GAA 
GGC GGC GAA GCT GGT CTG GCG CTG GCA TCT GGT ATG GGC GCG ATC 
ACC TCT ACC CTG TGG ACC CTG CTG CGT CCG GGT GAC 3' (sense) (SEQ 
ID NO. 13) 

mdePrl-5 (1 16 bps): 5 ' GC CAT ATC TAC GTG ACG CAG TTT AAC GCC GAA TTC ACC 
GAT ACC GTG GTG CAG GAA AGC AAA AGT ACA ACC ATA CAG GGT GTT 
GCC CAG CAG AAC TTC GTC ACC CGG ACG CAG CAG 3 ' (antisense) (SEQ ID 
NO. 14) 

mdePrl-6 (33 bps): 5' CAG TGC CTG CAG GTC A GC CAT ATC TAC GTG ACG 3' 
(antisense) (SEQ ID NO. 15) 
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2. Second segment. mdeA2 

The sequences of these oligonucleotides was as follows: 

5 

mdePr2-l (33 bps): 5 ' GCT GAC CTG CAG GCA CTG GAA GCG GCT ATG ACC 3' 
(sense) (SEQ ID NO. 16) 

mdePr2-2 (1 14 bps): 5' CTG GAG GCT GCT ATG ACC CCG GCT ACC CGT GTT ATC 
1 0 TAG TTC GAA TCC CCG GCT AAC CCG AAC ATG CAC ATG GCT GAC ATC 

GCA GGT GTT GCT AAA ATC GCT CGT AAG CAC GGC 3' (sense) (SEQ ID NO. 
17) 

mdePr2-3(115bps): 5' G GTA TTT AGT AGC GGA GTG AAC AAC CAG GTC AGC GCC 
15 CAG TTC CAG CGG ACG TTG CAG GTA CGG AGT AC A GTA GGT GTT ATC 

AAC AAC TAC GGT AGC GCC GTG CTT ACG AGC GAT 3' (antisense) (SEQ ID 
NO. 18) 

mdePr2-4 (1 1 1 bps): 5' CAC TCC GCT ACT AAA TAC C TG TCC GGC CAC GGC GAC 
20 ATC ACT GCT GGC ATC GTA GTA GGC TCC CAG GCA CTG GTT GAC CGT 

ATC CGT CTG CAA GGT CTG AAA GAC ATG ACC 3' (sense) (SEQ ID NO. 19) 

mdePr2-5 (11 5): 5' G TAC CTG AGC GTT AGC AC A GTG ACG GTC CAT ACG CAG GTT 
CAG GGT CTT GAT ACC ACG CAT CAG CAG TGC TGC GTC GTG CGG GGA 
25 CAG AAC AGC GCC GGT CAT GTC TTT CAG ACC 3' (antisense) (SEQ ID NO. 20) 

mdePr2-6 (33): 5' C CAG GAA TTC AGC CA G TAC CTG AGC GTT AGC AC 3 '.(antisense) 
(SEQ ID NO. 21) 

30 3. Third segment. mdeA2 

The sequences of these oligonucleotides was as follows: 

mdePr3-l (31 bps): 5' T CTT AAT GAA TTC CTG GCT CGT CAG CCG CAG 3' (sense) 
(SEQ ID NO. 22) 
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mdePr3-2 (105 bps): 5' CTG GCT CGT CAG CCG CAG GTA GAA CTG ATC CAC TAT 
CCG GGC CTG GCT TCC TTC CCG CAG TAC ACT CTG GCA CGT CAG CAG 
ATG TCC CAG CCG GGC GGT ATG ATC 3' (sense) (SEQ ID NO. 23) 

mdePr3-3 (106 bps): 5 ' C GTC ACC CAG GGA AAC CGC ACG GGA GAA CAG CTG CAG 
AGC GTT CAT GAA ACG ACG ACC AGC GCC GAT GCC ACC CTT CAG TTC 
GAA AGC GAT CAT GCC ACC CGG CTG 3' (antisense) (SEQ ID NO. 24) 

mdePr3-4 (106 bps) 5' GCG GTT TCC CTG GGT GAC G CT GAA TCC CTG GCG CAG 
CAC CCG GCA TCC ATG ACT CAC TCC TCC TAC ACT CCG GAA GAA CGT 
GCG CAC TAC GGC ATC TCC GAA GGC C 3' (sense) (SEQ ID NO. 25) 

nidePr3-5 (98 bps): 5' CA AGC GCT AGC CTT CAG AGC CTG CTG AAC GTC TGC CAG 
CAG ATC ATC GAT GTC TTC CAG ACC AAC AGA CAG ACG AAC CA G GCC 
TTC GGA GAT GCC GTA 3' (antisense) (SEQ ID NO. 26) 

mdePr3-6 (32 bps): 5' T GGT GGA TCC T CA AGC GCT AGC CTT CAG AGC C 3' 
(antisense) (SEQ ID NO. 27) 

Amplification of segmental DNA: mdeAl, mdeA2, mdeA3: 

Each segment synthesis took two rounds of amplification. The first round was to generate 
the template for the second round using the four long oligonucleotides with overlapping ends 
(e.g., 3' or 5' sense ends overlapping neighboring 5' or 3' antisense ends). The second round 
amplification was using the two short nucleotides and the template from the first. Standard PCR 
reaction mixture was used with 1 00 (il reaction volume, 0.2 mM dNTPs (final concentration), and 
60 to 90 pmoles of each oligonucleotide. 

To synthesize the template for mdeAl, termed tpAl, mdePrl-2 (71 pmoles), mdePrl-3 
(74 pmoles), mdePrl-4 (77 pmoles), and mdePrl-5 (64 pmoles) were used. MdePr2-2 (64 
pmoles), mdePr2-3 (73 pmoles), mdePr2-4 (67 pmoles), and mdePr2-5 (74 pmoles) were used 
to synthesize mdeA2 template, termed tpA2. To synthesize mdeA3 template, termed tpA3, 
mdePr3-2 (66 pmoles), mdePr3-3 (62.6 pmoles), mdePr3-4 (60 pmoles), and mdePr3-5 (82 
pmoles) were used. The strategy is shown in FIG. 2A. Based on the estimated annealing 
temperatures between the oligonucleotides above, the PCR reaction conditions were as follows: 



-24- 



40608/MAH/B583 



first denaturation at 94°C for 2 min; then 10 cycles of denaturation at 94°C for 30 sec; annealing 
at 51 °C for 40 sec, and extension at 72°C for 1 min. This was followed by 20 cycles of 
denaturation at 94°C for 30 sec; 65X for 55 sec; 72X for 1 min; then a final extension at 72 ""C 
for 7 min. The PGR was carried out using a Perkin-Elmer Gene Amp 9600. 

The PGR products were separated on 2 % agarose gels run with a 1 kb DNA ladder 
(NEB); product bands of the expected size (41 1 bps for tpA l, 401 bps for tpA2, and 360 bps for 
tpA3) were cut out and extracted using QIAquick gel extraction kit. The products were then used 
as the templates for second round PGR reactions to synthesize mdeA], mdeA2, and mdeA3 
DNAs. The strategy for the second round amplification is shown in FIG. 2B. 

For the second round, mdePrl-l(80 pmoles), mdePrl -6 (67 pmoles), and 1 |al of 50 |il gel 
purified template tpAl (above) were used to amplify the mdeAl segment, again with the 3' end 
of mdePrl-1 and mdePrl-6 overlapping the 5' end of the template, and each 3' end (of 
oligonucleotide or template) priming the extension of the full length segment product. Similarly, 
mdePr2-l (86 pmoles), mdePr2-6 (86 pmoles), and 1 |il template tpA2\ mdePr3-l (74 pmoles), 
mdePr3-6 (84 pmoles), and Ijil tpAS were used to amplify mdeA2 and mdeA3 segment 
respectively. The PGR reaction conditions were as follows: first denaturation at 94 °G for 2 min; 
then 25 cycles of denaturation at 94 °G for 30 sec, annealing at 51 °G for 40 sec, and extension 
at 72°G for 30 sec; followed by a final extension at 72°G for 7 min. 

The PGR-amplified products were identified by size on the 2 % agarose gel, a 441 bp- 
band for mdeAl, a 430 bp-band for mdeA2, and a 383 bp-band for mdeAS, The DNAs from the 
bands were extracted by using QIAquick gel extraction kit. 

EXAMPLE 3 - Gloning the synthetic DNA fragments mdeAL mdeA2^ and mdeA3 into an 
appropriate vector 

The vector pGEM-5Z (Promega, 3003 bps), and the purified PGR mdeAl DNA were 
double cut with Nco I and Pst I; pGEM-3Z (Promega, 2743 bps), and the purified PGR mdeA2 
DNA were double cut with Pst I and EcoR I restriction enzymes; pGEM-3Z and purified PGR 
mdeA3 DNA were double cut with EcoR I and BamW I restriction enzymes. These vectors carry 
the multiple cloning site arrangement from pUGl 8, and are ampicillin resistant. All restriction 
digestion reactions were incubated overnight at 37°G. The digested products were then purified 
by gel electrophoresis on a 2% agarose gel followed by extraction of the DNA using a QIAquick 
gel extraction kit. 



-25- 



LLJ 



i 4 



1 40608/MAH/B583 



The purified, double cut pGEM-5Z and mdeAl were ligated with T4 DNA ligase and 
buffers (NEB) and incubated overnight at 1 6°C. Similarly, the double cut pGEM-3z and mdeA2, 
5 and double cut pGEM-3z and mdeA3, were ligated with T4 DNA ligase, but they were incubated 
at 12°C because EcoR I site requires lower temperature to anneal. Several reactions were carried 
out for each construct to ensure optimization of molar ratios between vector and insert (e.g. 1:1, 
1:3, and 3:1 vector : insert ratio). FIG. 3 illustrates the multiple cloning site and ligation of 
inserts into the vectors. 

1 0 E. coli JM109 competent cells (Promega or Bio 101) were transformed with the ligation 

reactions described above using a standard heat shock transformation procedure (Sambrook et 
al., 1989, supra). To select for colonies containing mdeAl, mdeA2, and mdeA3 clones, the cells 
were grown on LB+Ampicillin (50|ig/|il) plates. 

Transformant colonies were first tested with PGR screening using the mdePr 1 - 1 , mdePr 1 - 
15 6, mdePr2-l, mdePr2-6, mdePr3-l , and mdePr3-6 as the primers for mdeAl, mdeAl, and mdeA3 
clones respectively. The PGR reaction volume was 25 |al with 0.2 mM dNTPs and 20 pmoles of 
each primers. The templates were picked directly from the colonies, and the conditions were as 
follows: first denaturation at 94 °C for 4 min; then 25 cycles of denaturation at 94 ""C for 30 s; 
annealing at 57 ""C for 40 s; and extension at 72 °G for 30 s; then a final extension at 72 °C for 7 
M: 20 niin. The positive colonies containing mdeAl, mdeA2, or mdeA3 clones were identified by the 

presence of 441 bp, 430 bp, or 383 bp bands respectively, 
g To further confirm that the colony actually carried the mdeAl, mdeA2, or mdeA3 

construct, restriction mapping of its plasmid was done by cutting the plasmid with Nco I + Pst 
I, Pst I + EcoK I, or EcoK I + BamR 1. The presence of a 426 bp-band (mdeAl), a 414 bp-band 
25 (mdeA2), or a 367 bp-band {mdeA3) would be expected on 2 % agarose gel if the plasmid carries 
the proper insert. 

EXAMPLE 4 - Sequencing of the synthetic mdeAL mdeAl. and mdeA3 DNA fi-agments 

After isolating plasmids containing tho mdeAl, mdeAl and mdeA3 inserts, the clones were 
30 submitted to the UCLA sequencing facility (Los Angeles, CA) for sequencing. Ml 3 forward and 
reverse primers were used. Clones that carried the correct DNA sequence of mdeAl, mdeAl, and 
mdeA3 were selected and named pSmAl-17, pSmA2-8, and pSmA3-3, 
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EXAMPLE 5 - Construction of full-length svnmdeA encoding methionine gamma-lvase 

The colonies containing pSmAl-17, pSmAl-S, and pSmA3-3 were cultured with 
LBH-ampicillin (50|ig/|il) overnight at 37 C. Plasmids were extracted using QIAprep spin 
miniprep kit (QIAGEN, Inc., Valencia, CA). The plasmids pSmAl-l 7, pSmA2-8, and pSmA3-3 
were double cut overnight at 37 °C with Nco llPst I, Pst l/EcoR I, and EcoR 1/ BamU I restriction 
enzymes respectively. A pETlSb vector (Novagen) was cut with Nco 1/ BamH I restriction 
enzymes, and a pB AD/His C vector (Invitrogen) was cut with Nco I/Bgl II. The double cut DNAs 
were separated on 2% agarose gel, and the bands corresponding to mdeA 1 (426 bps), mdeA2 (4 1 4 
bps), mdeA3 (367 bps), pETlSb (5k bps), and pB AD/His C (4 kbs) were isolated and purified 
using QIAquick gel extraction kit. 

Purified mdeAl, mdeA2, and mdeA3 DNAs were then ligated into double cut pETlSb at 
Nco I and BamU I, or pBAD/His C at A^co I and Bgl II cloning sites using T4 DNA ligase 
overnight at 12 °C. 

The resulting plasmids were transformed into E. coli JM109 competent cells using a 
standard heat shock transformation procedure (Sambrook et al., 1989, supra). To select the 
positive clones containing synmdeA, the cells were grovm on LB+Ampicillin (50|ig/|il) plates 
overnight at 37°C. 

The transformant colonies were first checked with the PGR screening method described 
above by using mdePrl-1 and mdePr3-6 as the primer probes. A 1200 bp-band was expected on 
the agarose gel if the colony contained synmdeA clones. Selected pETlSb and pB AD/His G 
vectors carrying the synmdeA insert were named pTM-1 and pBM-1 overexpression plasmids, 
respectively. The PGR positive colonies were then further confirmed by using a restriction 
mapping method, with Nco I and BamU I restriction enzymes used on pTM-1, and Nco I and 
Hind III restriction enzymes used on pBM-1. Again, 1200 bp-bands were seen on 2% agarose 
gels. 

Plasmids pTM-1 andpBM-1 were transferred to expression host £. co// BL21(DE3)and 
LMG 194 by first plasmid extraction, followed by transformation. 
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EXAMPLE 6 - Over-expression of synthetic L-Methionine-alpha-gamma-Ivase gene 

Host E, coli strains carrying pTM-1 and pBM-1, referred to as BL/pTMOl and 
5 LMG/pBMOl respectively, were grown on LB+ampicillin plate and RMG+ampicillin plate 
respectively. A single colony from each plate was then picked and cultured overnight in 
LB+ampiciilin liquid medium. Then 5 ml of LB+ampicillin was inoculated with 100 |il of each 
overnight culture, and each was incubated for 2 hours at 37°C with shaking or until O.D.^oo (nm) 
reached 0.8 - 0.9. Initially, 1 ml of each culture was removed as a non-induced control. 
1 0 BL/pTMO 1 culture was then induced to express protein by adding IPTG to a final concentration 
of 2 mM, and LMG/pBMO 1 culture was induced with a final concentration of 0.02% L-arabinose. 
Incubation was continued at 37 °C for 3 hours. Samples of 1 ml were collected every hour. All 
samples were centrifuged at 12,000 x g for 3 minutes. The cells were then lysed by resuspension 
Q in Ix NuPAGE sample buffer (No vex) containing 50 mM DTT, and incubation at 97°C for 3 

^rj 15 minutes. After centrifugation for 1 0 min at 1 2,000 x g, the supematants were separated along with 
yj protein size markers by SDS-page on 4%-20% gradient polyacry lamide gel (NuPAGE MES SDS, 

ri Novex) for 1 hour at 150 volts. The gels were stained by Coomassie blue for 2 hours and 

bj destained in 10% acetic acid, 20% methanol solution, followed by destaining in 7% acetic acid, 

^"^ 5% methanol. 43 kD bands corresponding to a molecular weight marker were seen on the 

s 

Li 20 destained gels (FIG. 4). These bands corresponded to the major protein in the induced samples. 
PJ As seen in FIG. 4, expression of synmdeA was vastly superior to expression of the native enzyme, 

2 seen in FIG. 5. The native enzyme expressed poorly in E. coli, and was a truncated portion of the 

D complete gene. Attempted expression of the native gene gave a protein of apparent molecular 

^ weight approximately 28kD, indicating that a substantial part of the enzyme was missing. The 

25 protein showed no methionine gamma-lyase activity. Without wishing to be bound to any 
particular mechanism, it is hypothesized that the truncation was caused by an interruption in 
translation at a rare codon. This speculation is supported by the fact that an interruption at this 
point would result in a polypeptide product having a molecular weight of approximately 28 kD. 

30 EXAMPLE 7 - Comparison of native mdeA and svnmdeA gene expression 

To demonstrate the usefialness of the synthetic gene for the expression of difficult to 
express genes in E. coli, the synmdeA gene was expressed in E coli using the vector pETlSb. 
This gene encodes a methionine P lyase enzyme, but contains an additional amino acid relative 
to the native protein described by Soda and co-workers (e.g., US Patent No. 5,863,788). The 

35 results are shown in the gel in FIG. 4A. Based on the density of the band corresponding to the 
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methionine-gamma lyase enzyme of approximate molecular weight 40,000 we estimate the level 
of expression to be 10% or more of the total protein in the crude cell lysate of the E. coli host. 
5 By contrast, expression of the native mdeA gene in the vector pSIT is substantially less under the 
same induction conditions (FIG. 4B). In the experiment shown in FIG. 4B, all samples were 
incubated at 3T'C,. The induced samples contain extra bands of about 28 kD which indicate that 
premature termination of the enzyme occurred during translation of the native gene. Both the 
native and synthetic gene vectors are under the control of T7 RNA polymerase promoters. 
10 To put these results into another context, the expression reported by Soda and coworkers 

in US Patent 5,861 ,1 54 and 5,863,788 is reported to be 0.82 units/mg. Using the specific activity 
of the purified enzyme of 20.4 units/mg reported by Soda in Anal. Biochime. 138, 421-424 
(1984), the expression level is estimated to be no more than 4% of the total protein in the E. coli 
host. This estimate is an upper limit on the expression reported by Soda because the reported 
1 5 activity involves some partial purification of the enzyme prior to assay. 

EXAMPLE 8 - Comparison of expression of genes with different AG ^1^;^^ 

FIG: 5 is a gel showing expression of two genes with different AGf^i^ji^g. Naphthalene 
Dioxygenase from P. putida has a AG^i^ing of -256. 1 kcal/mol. This very low free energy would 
H= 20 not be expected, under the principles of the invention, to express well. In fact, as seen in lanes 
1-4 of FIG. 5, it does not. By contrast, another gene, methionine gamma lyase {mgl 1) from T. 
m vaginalis has a AGfoidj^g of - 1 52.5 kcal/mol. As can be seen from lanes 6-9 of FIG. 5, this protein 

Q can be induced and expresses well under the conditions used. Both genes were cloned into the 

^ pBAD vector and grovm at 37°C. 

25 

EXAMPLE 9 - Synthesis of improved eukarvotic genes and their expression in prokarvotic hosts 
Oxidoreductases 

The enzyme family of oxidoreductases is large and complex, and many members fiinction 
to stereoselectively oxidize and reduce functional groups such as C=0, C=C, and C=N. In 
30 pharmaceutical and agricultural industries, for example, these enzymes are used to prepare drugs 
and chemicals requiring e.g., chiral compounds. For example, they can be used to 
stereoselectively reduce ketones to produce chiral alcohols consisting predominantly of a single 
stereoisomer. In this Example, the methods of the invention were used to create highly 
expressible oxidoreductases. Properties of exemplary original oxidoreductase genes and their 
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synthetic analogs are discussed and shown in Table 3 below, and the superiority of the synthetic 
sequences in AG, expression, and enzyme activity can be seen. 

5 

Keto Reductases 

These enzymes reduce keto esters, aldehydes, and other ketones into equivalent alcohol 
products. 

NADPH-Dependent Aldehyde Reductase L ALRI : The native gene encoding a NADPH- 
1 0 dependent aldehyde reductase (ALR) is from a red yeast, Sporobolomyces salmonicolor (also 
known as Sporidiobolus salmonicolor), and catalyzes the reduction of a variety of carbonyl 
compounds. The gene is 969 bp (SEQ ID NO. 3 1) and encodes a polypeptide of 35,232 Da. The 
deduced amino acid sequence (SEQ ID NO. 32) shows a high degree of similarity to other 
members of the aldo-keto reductase superfamily. The synthetic aldehyde reductase 1 gene 
%| 1 5 (synALRl ; SEQ ID No. 33) was created using the known protein sequence. 
^ Aldehyde Reductase 2. ALR2 : This gene, encoding an NADPH-dependent aldehyde 

fij reductase (AR2) in Sporobolomyces salmonicolor AKU4429, reduces ethyl 4-chloro-3- 

y oxobutanoate (4-COBE) to ethyl (S)-4-chloro-3 -hydroxy butanoate (Kita et al, Appl Environ 

^ Microbiol 1999 Dec; 65(12):5207.1 1). The ALR2 gene (SEQ ID NO. 34) is 1,032 bp long and 

M= 20 encodes a 37,3 1 5-Da polypeptide. The deduced amino acid sequence (SEQ ID NO. 35) exhibits 
significant levels of similarity to the amino acid sequences of members of the mammalian 3 -beta- 
hydroxysteroid dehydrogenase-plant dihydroflavonol 4-reductase superfamily but not to the 
amino acid sequences of members of the aldo-keto reductase superfamily or to the amino acid 
sequence of an aldehyde reductase previously isolated from the same organism (K. Kita, et al, 
25 Appl Environ, Microbiol 62:2303-2310, 1996; SEQ ID NO. 32). The synthetic version of 
ALR2, or synALR2mut (SEQ ID NO. 36) contains a mutation at position 25 of the amino acid 
sequence (SEQ ID NO. 37), replacing alanine with glycine to introduce a mutation that allows 
the enzyme to use both NADH and NADPH as a cofactor. 

Reductase 1 from yeast, YPRl : This enzyme is a good general ketone reductase. The 
30 "native" sequence, related to Accession No. X80642 (Miosga et al ), was cloned into pB AD with 
a GOT insertion after the initiating ATG (SEQ ID NO. 38). This addition resulted in a glycine 
at position 2 in the amino acid sequence in both the "native" and the synthetic YPRl peptide 
sequence (SEQ ID NO. 39) to add a restriction site for ease of cloning. SEQ ID NO. 40 is the 
synthetic sequence, having a 15.1% improvement in AGfo,ding. 

35 
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Yeast GCYJ : SEQ ID NO. 41 is a nuclear gene for a yeast protein showing unexpectedly 
high homology with mammalian aldo/keto reductases as well as with p-crystallin, one of the 
prominent proteins of the frog eye lens. The coding region is 939 bases and encodes a protein 
of 312 amino acids (SEQ ID NO. 42; estimated MW 35,000). A synthetic analog was made, 
synGCYl (SEQ ID NO. 43), having a GGC insertion after ATG (to facilitate cloning into the 
pBAD vector), which results in the insertion of a glycine after the initiating methionine in the 
synthetic peptide sequence (SEQ ID NO. 44). ^ 

Reductase Gre2 from yeast : This gene and related protein product were originally 
sequenced as part of the yeast genome (Goffeau et al. Accession Nos. NC_001147 and 
NP_0 14490). The native gene (SEQ ID NO. 45) was not cloned, and its protein sequence (SEQ 
ID NO. 46) is based on the best open reading frame. However, the synthetic gene synGRE2 
(SEQ ID NO, 47) derived from the wild-type sequence was modified by addition of a GGC 
insertion (to add a restriction site), cloned, and expressed as a protein (SEQ ID NO. 48). The 
protein's reductase function has been confirmed. 

Yeast Aldo-Keto Reductase Gre3 : This gene and related protein encode a keto-aldose 
reductase (Goffeau et al , Accession Nos. NC_00 1 1 40 and NP_0 1 1 972). The "native" sequence 
(SEQ ID NO. 49) has been modified to insert an ATT at the second codon position (inserting 
isoleucine in SEQ ID NO. 50) to add a restriction site for cloning. The "native" Gre3 protein 
exhibits reductase activity at 30*^0 and 37°C as shown in Table 3 below. 

CMKR (SI): The product of this gene (SEQ ID N0.69) is an NADPH-dependent 
carbonyl reductase (SI) from Candida magnoliae, which catalyzes the reduction of ethyl 4- 
chloro-3-oxobutanoate (COBE) to ethyl (S)-4-chloro-3 -hydroxy butanoate (CHBE), with a 1 00% 
enantiomeric excess. This is a usefiil chiral building block for the synthesis of pharmaceuticals. 
The SI gene is 849 bp and encodes a polypeptide of 30,420 Da. The deduced amino acid 
sequence (SEQ ID NO. 70) has a high degree of similarity to those of other members of the short- 
chain alcohol dehydrogenase superfamily. 
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Table 3: Properties of Native and Synthetic Genes 



Gene name 


length 


Molecular 


AG 


AG/base 


% AG difference 


Activity at 


Activity at 




(bps) 


Weight (kD) 


(kca I/mole) 


(kcal/mole*base) 


between native 
and synthetic 


30°C (u/ml) 


37°C (u/ml) 


nativeALRl 


972 


35.2 


-152.5 


-0.157 


100 


ND 


ND 


synALRl 


972 


35.2 


-85.8 


-0.0883 


56.3 


75.5 


9,25 


nativeALR2 


1032 


37.3 


-162.2 


-0.1572 


100 


ND 


ND 


synALR2mut 


1032 


37.3 


-101.2 


-0.0981 


62.4 


4.13 


7.0 


nativeYPRl 


942 


34.8 


-89.4 


-0.0949 


100 


4.15 


6.23 


synYPRl 


942 


34.8 


-75.9 


-0.0806 


84.9 


11.791 


16.609 


nativeGCYl 


939 


35.1 


-76.6 


-0.0816 


100 


0.105 


0.533 


synGCYl 


942 


35.1 


-73.2 


-0.0777 


95.2 


4.00 


4.53 


nativeGRe2 


1029 


38.2 


-103.3 


-0.1004 


100 


ND 


ND 


synGRE2 


1032 


38.2 


-71.6 


-0.0694 


69.1 


ND 


ND 


nativeGRE3 


987 


37.2 


-89 


-0,0902 


100 


0.35 


0.52 


synGRE3 


987 


37.2 


-65.5 


-0.0664 


73.6 


1.2 


1.1 


native 


852 


30.6 


-145.4 


-0.1706 


100 


ND 


ND 


CMICR 
















synCMKR 


852 


30.6 


-70.5 


-0.0827 


48.5 


ND 


239.16 


pKDDC 


1461 


54.0 


-244.4 


-0.1673 


100 


ND 


ND 


synAAAD 


1464 


54.0 


-133.9 


-0.0915 


54.7 


ND 


ND 


Fdhl.2 


1098 


40.6 


-76.1 


-0.0693 


100 


0.48 


0.54 


synFdh 


1098 


40.6 


-98 


-0.0893 


128.9 


2.48 


0.19 



ND = not determined. 



Other Sequences 

L-Aromatic Amino Acid Decarboxylase from Pig Kidney. L-Aromatic amino acid 
decarboxylase (dopa decarboxylase; DDC) is a pyridoxal 5'-phosphate (PLP)-dependent 
homodimeric. enzyme that catalyzes the decarboxylation of L-dopa and other L-aromatic amino 
acids. A cDNA that codes for the protein from pig kidney was cloned by Moore et al, Biochem 
J1996 Apr 1;315 (Pt l):249-56. Using this pKDDC sequence(SEQ ID NO. 53; Accession No. 
S 82290) and its deduced amino acid sequence (SEQ ID NO 54), a synthetic decarboxylase, 
synAAAD was constructed with a GOT insertion (SEQ ID NO 55) to insert a glycine in the 
amino acid sequence (SEQ ID NO. 56). The synAAAD nucleic acid sequence had a nearly 50% 
improvement in AG (see Table 3). 

Formate Dehydrogenase (FdhL 2) : The formate dehydrogenase (Fdhl .2) DNA (SEQ ID 
NO. 57) and protein sequence (SEQ ID NO. 58) is from Candida boidinii (Accession No. 
AJ245934). In order to create a Nco I restriction site for cloning into expression vector 
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pBAD/HisA, a glycine codon was inserted after the first methionine codon (SEQ ID NO. 59). 
The resultant recombinant protein, synFdh (SEQ ID NO. 60) has an inserted glycine after the 
initiating methionine as compared to the native protein. Native Fdhl.2 and synFdh otherwise 
encode the same protein sequence. The synthetic sequence had 199 out of 366 codons changed 
as compared to native Fdhl .2 to optimize expression in E. coli (see Table 4 below). Homology 
at the DNA level of Fdhl .2 and synFdh is about 78.5%. Expression of synFdh is 5-fold higher 
based on activity measurements than expressed native Fdhl .2. 

The AG of Fdhl.2 is -76.1 kcal/mole (-0.069 kcal/mol base) and the AG of synFdh is - 
98.0 kcal/mole (-0.089 kcal/mol-base). Because native Fdhl.2 does not have high secondary 
structure, it was possible to optimize the sequence for expression according to methods of the 
invention without increasing, and in fact, slightly decreasing, the AGfo^ing. FIG. 7 shows 
expression data of Fdhl.2 compared with synFdh at 30°C and 37°C. As shown in FIG. 8, 
synFdh, induced with 0.2% L-arabinose at 30''C, exhibits higher catalytic activity than does 
induced native Fdhl .2 or uninduced Fdhl .2 in the oxidation of formate in the presence of NAD^ 
(NAD"^ H- HC02' -> NADH + COj). These figures demonstrate the superior expression 
characteristics of the synthetic Fdh sequence as compared to the native sequence. 



Table 4: Codon Preference of C boidinii and E. coli for Selected Amino Acids 



Amino Acid 


C boidinii Codon 


E. coli Codon 


R(Arg) 


AGA (13/13) 


AGA (0); CGT (0.74) 


,N(Asn) 


AAT (14/16) 


AAT (0.06), AAC (0.94) 


D (Asp) 


GAT (22/24) 


GAT (0.33), GAC (0.67) 


Q (Gin) 


CAA (9/9) 


CAA (0.14), CAG (0.86) 


L (Leu) 


TTA (22/32) 


TTA (0.02), CTG (0.83) 


P (Pro) 


CCA (11/14) 


CCA (0.15), CCG(0.77) 


T(Thr) 


ACT (11/22) 


ACT (0.35), ACC (0.55) 



Hvdantoinase: The hydantoinase gene from Pseudomonas putida (SEQ ID NO. 61) and 
its deduced amino acid sequence (SEQ ID NO. 62) (Accession No. AAC00209) were used to 
create synthetic hydantoinase gene (SEQ ID NO. 63) and protein (SEQ ID NO. 64) products. This 
gene product is useful to make non-natural a-amino acids. To create the synthetic gene and 
protein, a glycine was added after the first methionine so that the gene could be subcloned into 
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the pBAD/HisA expression vector. The nucleic acid sequence is 1491 bp, and in its native form 
has a free energy of folding of -287.6 kcal/mole. The synthetic hydantoinase has a AG of -1 55.5 
5 kcal/mole. Homology at the nucleic acid level between the native and synthetic hydantoinase is 
78.4%. 

Vonillvl Alcohol Oxidase, VaoA: A vanillyl-alcohol oxidase gene (SEQ ID NO. 65) and 

its deduced amino acid sequence (SEQ ID NO. 66) from Penicillium simplicissimum was used. 

VaoA oxidizes vanillyl alcohol and related aromatic alcohols. To create the synthetic gene (SEQ 
10 ID NO. 67) and protein (SEQ ID NO. 68), a glycine was added after the first methionine so that 

the gene could be subcloned into the pBAD/HisA expression vector. The sequence is 1686 bp 

long and the native form has a AG of -176.8 kcal/mole; AG of synVaoA is -164.6 kcal/mole. The 

genes have 77% homology at the nucleic acid level. 
O Mvo-Inositol-1 -Phosphate Synthase (Inol): INO-1 (SEQ. ID NO. 73) cyclizes D-glucose 

Ci 1 5 6-phosphate to myo-inositol 1 - phosphate, which is a precursor for coenzyme Q. The native ino-l 
W gene (SEQ. ID NO. 72) is 1602 bps. The AG is -152.2 kcal/mole. The synthetic ino-l gene, 

P called synIno-1 (SEQ. ID NO. 74) has a GGT insertion to create the cloning site, which inserts 

Ly a glycine residue in the synINO protein (SEQ. ID NO. 75). synIno-1 is 1605 bps and has a AG 

of -1 31.8 kcal/mole. The similarity at DNA level of the ino 1 and synino 1 is 77.4 %. 
\=k 20 Galactose Oxidase (GAP): The gaoA gene (SEQ. ID. NO. 76), encoding the secreted 

copper-containing enzyme galactose oxidase (SEQ. ID. NO. 77), was isolated from the 
2 Deuteromycete fiingus Dactylium dendroides (Accession number: M86819; also called 

D Hypomyces rosellus), AG for the native DNA is -244 kcal/mole and AG for the synthetic gene 

(synGAO, SEQ.ID. NO. 78) is - 195.3 kcal/mole. The open reading frame for galactose oxidase 
25 (GAO) is 2046 bp. At the DNA level, synGAO and GAO have 76.6% identity. Glactose oxidase 

oxidizes galactose, and can be used in the quantitative determination of galactose level in blood. 

The synthetic galactose oxidase protein (SEQ. ID NO. 79) has a glycine inserted in the second 

amino acid position. 

The Gibbs free energy (AG) of all DNA foldings described in this Example were 
30 determined using mfold2 provided by Washington University School of Medicine 
(http://mfold2.wustl.edu). The conditions used for calculation of the free energy of DNA folding 
were 37 °C, Na"= 1 M and Mg"^= 0. 

Assays of enzyme activities of the keto reductases were determined photometrically using 
Ethyl 4-chloroacetoacetate as a substrate. The reaction mixture (1.0 ml) comprised 50 mM 
35 potassium phosphate buffer (pH 6.5), 250 M NADPH, 5mM substrate, and cell lysate. The 
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reaction was measured at room temperature. One unit of the enzyme was defined as the amount 
catalyzing the oxidation of 1 mole NADPH/min. Formate dehydrogenase activity was assayed 

5 by mixing sodium formate with NAD+5 and measuring NADH recycling activity on a 
spectrophotometer at 340 nm. 

As seen by the results generated in this example, the methods of the invention are v^dely 
applicable to unrelated genes from both prokaryotes and eukaryotes, and result in improved 
expression and enzymatic activity when expressed in a heterologous prokaryotic or eukaryotic 

10 host cell. 

The preceding description has been presented with references to presently preferred 
embodiments of the invention. Persons skilled in the art and technology to which this invention 
pertains will appreciate that alterations and changes in the described genes, proteins, and methods 
can be practiced without meaningfully departing fi*om the principle, spirit and scope of this 
1 5 invention. 

Accordingly, the foregoing description should not be read as pertaining only to the precise 
genes, proteins, and methods described and shovra in the accompanying drawings, but rather 
should be read as consistent with and as support for the following claims, which are to have their 
fullest and fairest scope. 
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